fix: the cosine similarity is evaluated for top comments and bot comments are ignored #225

gentlementlegen · 2024-12-26T07:33:03Z

Resolves #174

What are the changes

bot comments are ignored when building the prompt, which greatly reduces token usage and innacuracy
if the token count for the built prompt is higher than the model limit, strip the comments down to the 10 most relevant ones based on their cosine similarity with the specification

Rfc @sshivaditya2019

…ents are ignored

0x4007

Very skeptical of tfidf approach. We should go simpler and filter out more instead.

0x4007 · 2024-12-28T22:52:08Z

src/parser/content-evaluator-module.ts

@@ -15,6 +15,9 @@ import {
 import { BaseModule } from "../types/module";
 import { ContextPlugin } from "../types/plugin-input";
 import { GithubCommentScore, Result } from "../types/results";
+import { TfIdf } from "../helpers/tf-idf";
+
+const TOKEN_MODEL_LIMIT = 124000;


This depends on the model and possibly should be an environment variable because we might change models.

Hard coding the 12400 doesn't seem like a solution there either

@0x4007 It is not hard coded but configurable within the config file?
There is no API to retrieve a model max token value as far as I know.

Line 179 is hard coded

What should I do if the configuration is undefined, should I throw an error and stop the run then?

Yes if we don't have it saved in our library or collection of known amounts then it should throw

There is no known amounts because no API nor endpoints can give this information, so I'll just throw when undefined then.

Manually get the numbers from their docs then

The problem is that this number is arbitrary, diddn't you just ask OpenAI to raise the limits on the account we're using, with the same model as before and they did increase the limit? I'm afraid this number can't be guessed or hard-coded.

src/parser/content-evaluator-module.ts

shiv810 · 2024-12-29T06:03:21Z

I don't think TF-IDF would be the best option for selecting the comments, as it only takes in account the word frequency and does not give any importance to semantics. A better solution might be to switch to a larger context model, like Gemini, which provides a 2 million token context window when we reach the token limit, and exclude bot-generated comments from the selection process.

…pt for LLMs

gentlementlegen · 2024-12-29T06:28:17Z

This was added as a failsafe when we go through the limit. Gemini can be an option, but theoretically we could also go beyond its token limit (even if unlikely). Since we can configure models which all have different token max limits (and third parties could be using smaller and cheaper ones) I think it is important that the technique we use to lower the context size is not based on LLM.

0x4007 · 2024-12-29T07:14:20Z

More careful filtering of comments like removal of bot commands and comments, and possibly text summarization, or concatenation of multiple calls are all more accurate approaches. TFIDF is not the right tool for the job.

gentlementlegen · 2024-12-29T07:37:22Z

The commands and bot comments got also fixed in this PR. I added this as a last resort if this was not enough. As I said before, I don't think it should rely on LLMs itself because third party users could chose a tiny model like gpt-3 and have only 3000 tokens available which would not be capable to even summarize. I can change this to summarize each comment in one sentence, which would use one call to the LLM per comment, which I though would potentially be expensive.

0x4007 · 2024-12-29T07:39:46Z

Doing multiple calls to score everything and then concatenate results seems the most straightforward with no data loss.

gentlementlegen · 2024-12-29T07:59:30Z

@0x4007 Doing the evaluation is not the problem, the problem is the given context to the LLM that gets too big. If an issue has 300 comments, then the prompt would contain these 300 comments while evaluating which would be too many for the token limit, so it has to get smaller. I don't see a way to fix that with no data loss, except if you meant comparing the comment against every other single comment one by one?

0x4007 · 2024-12-30T04:32:35Z

Divide into two and do 150 each call. Receive the results array and concatenate them together

gentlementlegen · 2024-12-30T05:57:02Z

@0x4007 This is not reliable, and if a third party decides to use a model like o4-mini it would break anyways. What should be done in that case?

Plus concatenating would not make sense. I should run the comment against 150 first and 150 last (which would make the context inaccurate) and then probably average the result. I believe you didn't see how comment are evaluated: the problem is that for the context of understanding comments, we send all the comments in the context evaluated against the comments of the user. Refer to:

text-conversation-rewards/src/parser/content-evaluator-module.ts

Line 254 in 66c1e77

return `

0x4007 · 2024-12-31T02:53:06Z

Surely it's a bit of a trade off without all of the comments in one shot, but this approach seems to trades off the least.

The task specification is by far the most important reference point.
We can inject a new line in the prompt explaining that following is part 2/2 and that due to context window limits we had to split the conversation into multiple parts for evaluation.

and if a third party decides to use a model like o4-mini it would break anyways. What should be done in that case?

Why would they complain about using a non default model? We set the default to what we know works.

…tion

gentlementlegen · 2025-01-04T03:49:56Z

@0x4007 Changed the behavior when the limit of the model is burst:

split the prompt by n chunks
average the list of scores by comment

src/parser/content-evaluator-module.ts

.cspell.json

# Conflicts: # dist/index.js

gentlementlegen · 2025-01-08T11:50:24Z

[ 115.94 WXDAI ]
@gentlementlegen

Contributions Overview

View	Contribution	Count	Reward
Issue	Task	1	100
Issue	Specification	1	15.94

Conversation Incentives

Comment	Formatting	Relevance	Priority	Reward
> ```diff@gentlementlegen perhaps we have too m…	7.97 content: content: p: score: 0 elementCount: 2 em: score: 0 elementCount: 1 a: score: 5 elementCount: 1 result: 5 regex: wordCount: 54 wordValue: 0.1 result: 2.97	1	2	15.94

[ 21.956 WXDAI ]
@0x4007

Contributions Overview

View	Contribution	Count	Reward
Issue	Comment	1	0.69
Review	Comment	19	21.266

Conversation Incentives

Comment	Formatting	Relevance	Priority	Reward
I think high accuracy is the best choice from your selection. I …	1.38 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 22 wordValue: 0.1 result: 1.38	0.25	2	0.69
Very skeptical of tfidf approach. We should go simpler and filte…	0.94 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 14 wordValue: 0.1 result: 0.94	0.15	2	0.282
This depends on the model and possibly should be an environment …	1.11 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 17 wordValue: 0.1 result: 1.11	0.4	2	0.888
We should also filter out slash commands? And minimized comments?	0.71 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 10 wordValue: 0.1 result: 0.71	0.3	2	0.426
I'm skeptical about this whole TFIDF approach1. The tokenizer a…	8.64 content: content: p: score: 0 elementCount: 1 ol: score: 1 elementCount: 1 li: score: 0.5 elementCount: 3 result: 2.5 regex: wordCount: 127 wordValue: 0.1 result: 6.14	0.35	2	6.048
Can you articulate the weaknesses or concerns	0.52 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 7 wordValue: 0.1 result: 0.52	0.25	2	0.26
Hard coding the 12400 doesn't seem like a solution there either	0.83 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 12 wordValue: 0.1 result: 0.83	0.2	2	0.332
Line 179 is hard coded	0.39 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 5 wordValue: 0.1 result: 0.39	0.15	2	0.117
Yes if we don't have it saved in our library or collection of kn…	1.28 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 20 wordValue: 0.1 result: 1.28	0.25	2	0.64
It shouldn't affect it at all. I would proceed with implicit app…	1.44 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 23 wordValue: 0.1 result: 1.44	0.1	2	0.288
Manually get the numbers from their docs then	0.59 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 8 wordValue: 0.1 result: 0.59	0.3	2	0.354
Why is this a constant? Makes more sense to use let and directly…	1.28 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 20 wordValue: 0.1 result: 1.28	0.4	2	1.024
```suggestion```	0 content: content: {} result: 0 regex: wordCount: 0 wordValue: 0.1 result: 0	0.05	2	0
Add more chunks if the request to OpenAI fails for being too lon…	2.1 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 36 wordValue: 0.1 result: 2.1	0.45	2	1.89
@shiv810 rfc	0.18 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 2 wordValue: 0.1 result: 0.18	0.1	2	0.036
Separate is fine then just as long as the current code is stable.	0.88 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 13 wordValue: 0.1 result: 0.88	0.25	2	0.44
More careful filtering of comments like removal of bot commands …	2.05 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 35 wordValue: 0.1 result: 2.05	0.45	2	1.845
Doing multiple calls to score everything and then concatenate re…	1.17 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 18 wordValue: 0.1 result: 1.17	0.425	2	0.995
Divide into two and do 150 each call. Receive the results array …	1.06 content: content: p: score: 0 elementCount: 1 result: 0 regex: wordCount: 16 wordValue: 0.1 result: 1.06	0.375	2	0.795
Surely it's a bit of a trade off without all of the comments in …	6.58 content: content: p: score: 0 elementCount: 1 ol: score: 1 elementCount: 1 li: score: 0.5 elementCount: 2 result: 2 regex: wordCount: 90 wordValue: 0.1 result: 4.58	0.35	2	4.606

[ 2.456 WXDAI ]
@shiv810

Contributions Overview

View	Contribution	Count	Reward
Review	Comment	3	2.456

Conversation Incentives

Comment

Formatting

Relevance

Priority

Reward

@gentlementlegen It shouldn't impact the comment evaluation at a…

1.7

content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 28
  wordValue: 0.1
  result: 1.7

0.6

2

0.516

Besides the error code, `OpenRouter` provides `Provi…

1.38

content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 22
  wordValue: 0.1
  result: 1.38

0.8

2

0.56

I don't think TF-IDF would be the best option for selecting the …

3.66

content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 69
  wordValue: 0.1
  result: 3.66

0.75

2

1.38

This is the result I got while limiting token count to 2000 which led to comment splitting. Note that I use the default configuration not ubiquity-os-marketplace. Target was ubiquity-os-marketplace/text-conversation-rewards/174

… issue comments

gentlementlegen added 2 commits December 26, 2024 16:32

fix: the cosine similarity is evaluated for top comments and bot comm…

c3d091b

…ents are ignored

chore: fixed cspell words

8d65638

gentlementlegen marked this pull request as ready for review December 26, 2024 07:39

0x4007 reviewed Dec 28, 2024

View reviewed changes

chore: only curated comments are taken into account to build the prom…

3cbd808

…pt for LLMs

gentlementlegen added 2 commits December 29, 2024 15:36

chore: token limit is configuration through the configuration file

a0695e3

test: fixed configuration test

70705b7

gentlementlegen marked this pull request as draft January 3, 2025 08:31

gentlementlegen added 3 commits January 4, 2025 11:56

chore: throw on missing token limit

538d177

fix: long prompts are now split by n-chucks and averaged after evalua…

93308f4

…tion

chore: removed unused package natural

5e1c5d8

gentlementlegen marked this pull request as ready for review January 4, 2025 03:49

chore: updated manifest.json and dist build

5ba0013

0x4007 reviewed Jan 4, 2025

View reviewed changes

src/parser/content-evaluator-module.ts Show resolved Hide resolved

0x4007 reviewed Jan 4, 2025

View reviewed changes

.cspell.json Outdated Show resolved Hide resolved

gentlementlegen added 3 commits January 4, 2025 14:00

chore: removed cspell word

b13cae5

chore: fixed repo payload

80df742

chore: fixed ref url

3c42275

gentlementlegen mentioned this pull request Jan 8, 2025

Implement retry mechanism for LLM failures #236

Open

Merge branch 'development' into fix/tfidf

4e544c5

# Conflicts: # dist/index.js

gentlementlegen and others added 2 commits January 8, 2025 20:51

fix: the comment splitting now processes differently pull-request and…

9ac8052

… issue comments

chore: updated manifest.json and dist build

32a2d09

gentlementlegen requested review from shiv810 and 0x4007 January 11, 2025 05:51

gentlementlegen and others added 2 commits January 11, 2025 17:31

chore: bump sdk package version

9bcf29d

chore: updated manifest.json and dist build

862306f

fix: the cosine similarity is evaluated for top comments and bot comments are ignored #225

Are you sure you want to change the base?

fix: the cosine similarity is evaluated for top comments and bot comments are ignored #225

Conversation

gentlementlegen commented Dec 26, 2024 • edited Loading

What are the changes

0x4007 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shiv810 commented Dec 29, 2024

gentlementlegen commented Dec 29, 2024

0x4007 commented Dec 29, 2024 • edited Loading

gentlementlegen commented Dec 29, 2024

0x4007 commented Dec 29, 2024 • edited Loading

gentlementlegen commented Dec 29, 2024

0x4007 commented Dec 30, 2024

gentlementlegen commented Dec 30, 2024 • edited Loading

0x4007 commented Dec 31, 2024 • edited Loading

gentlementlegen commented Jan 4, 2025

gentlementlegen commented Jan 8, 2025 • edited Loading

Contributions Overview

Conversation Incentives

Contributions Overview

Conversation Incentives

Contributions Overview

Conversation Incentives

gentlementlegen commented Dec 26, 2024 •

edited

Loading

0x4007 commented Dec 29, 2024 •

edited

Loading

0x4007 commented Dec 29, 2024 •

edited

Loading

gentlementlegen commented Dec 30, 2024 •

edited

Loading

0x4007 commented Dec 31, 2024 •

edited

Loading

gentlementlegen commented Jan 8, 2025 •

edited

Loading